Malaria Detection

Problem Definition

The context: Why is this problem important to solve?
The objectives: What is the intended goal?
The key questions: What are the key questions that need to be answered?
The problem formulation: What is it that we are trying to solve using data science?

Data Description

There are a total of 24,958 train and 2,600 test images (colored) that we have taken from microscopic images. These images are of the following categories:

Parasitized: The parasitized cells contain the Plasmodium parasite which causes malaria
Uninfected: The uninfected cells are free of the Plasmodium parasites

Malaria is a disease affecting tropical countries.The disease is predominant in several under developed countries and due to the effects of climate change there is concern that the disease will spread locally as well.The disease can be cured but it is one of the deadliest disease affecting humans.I personally have been infected several times during my childhood days in Nigeria and this project reminds me about my past struggles with this disease.

The intended goal of the project is to properly diagnose the disease in the early stages and to mitigate the severity and spread.This is crucial as the disease always require proper treatment.

Data science can play a crucial role,especially in poor countries where there is a lack of medical resources and professionals that can help diagnose the disease in its infancy and ready the patient to take necessary treatment. In developed countries,convolution neural network models can be trained to become very efficient in early detection of the parasite and could help relax the burden on medical professionals in the event of an emergency or lack of medical resources which we experienced during the latest covid pandemic.

The key questions that need to be answered is how easy we can collect the data that is needed to train our model,how consistent and reliable the data is and if we could gather large population for all the classes that we are trying to learn and train our model.It is also important to understand any bias in the data and take action to remove bias or mitigate its effect.

Mount the Drive

Loading libraries

Let us load the data

Note:

The extracted folder has different folders for train and test data will contain the different sizes of images for parasitized and uninfected cells within the respective folder name.

The size of all images must be the same and should be converted to 4D arrays so that they can be used as an input for the convolutional neural network. Also, we need to create the labels for both types of images to be able to train and test the model.

Let's do the same for the training data first and then we will use the same code for the test data as well.

Check the shape of train and test images

Check the shape of train and test labels

Observations and insights:

There are 24958 training labels and 2600 test labels.We observed that the shapes of training and test images are varying in pixel size.We will have to resize the images to the same size prior to training the model.

Check the minimum and maximum range of pixel values for train and test images

Observations and insights:

For training images the Red pixels varies from 0 to 255, Green pixels varies from 0 to 244 and Blue pixels varies from 0 to 246

For testing images the Red pixels varies from 0 to 255, Green pixels varies from 0 to 231 and Blue pixels varies from 0 to 215

Analysing the pixels shows that Red color is the most prominent for both training and testing images.

Count the number of values in both uninfected and parasitized

Normalize the images

Observations and insights:

We have 24958 images for both infected and uninfected type for training and 2600 test images for same type for evaluating our model performance. We have resized the train and test images to 64 X 64 and changed the data type of both to float32. We have also normalized the images by dividing them with 255.

Plot to check if the data is balanced

Observations and insights:

We have 12582 training images for parasitized and 12376 images for uninfected type.The testing images have 1300 each for uninfected and parasitized. We have 206 more training images for parasitized than the uninfected images. The training data is not balanced while the test data is balanced. It is preferrable to have a balanced data for training,but the difference in the two types(206) is not too much to cause a significant reduction in accuracy.I'll train the model to evaluate the accuracy and f1-score and if there is enough variation in precision and recall will try to balance the data and see how it goes.

Data Exploration

Let's visualize the images from the train data

Observations and insights:

The infected cell images have a patch of dark pink color that shows the infections.The uninfected cells don't have this which is evident in the 36 image subplot in the next cell below this that shows both types.

Visualize the images with subplot(6, 6) and figsize = (12, 12)

Observations and insights:

Plotting the mean images for parasitized and uninfected

Mean image for parasitized

Mean image for uninfected

Observations and insights:

The mean image for parasitized and uninfected images are plotted above and we can see that there is a slight increase in the intensity of color for parasitized images compared to the uninfected images.

Converting RGB to HSV of Images using OpenCV

Converting the train data

Converting the test data

Observations and insights:

We can observe the infections clearly in a lighter shade of "white" patches in the hsv version of the training images.The uninfected images does not have this patch. It can be seen that converting to hsv have added more noise to the images and may not be helpful in getting accuracy for our models.

Processing Images using Gaussian Blurring

Gaussian Blurring on train data

Gaussian Blurring on test data

Observations and insights:

Think About It: Would blurring help us for this problem statement in any way? What else can we try?

Gaussian blurring is effective is smoothing the image and reduce noise and much detail.It is like viewing the image through a translucent screen.It also reduced the standard deviation of pixel values in the image.

I'll evaluate the model performance without using the blurring and see how it performs and later analyse the impact of using the blurred images for retraining the best CNN model.

Model Building

Base Model

Note: The Base Model has been fully built and evaluated with all outputs shown to give an idea about the process of the creation and evaluation of the performance of a CNN architecture. A similar process can be followed in iterating to build better-performing CNN architectures.

Importing the required libraries for building and training our Model

One Hot Encoding the train and test labels

Building the model

Compiling the model

Using Callbacks

Fit and train our Model

Evaluating the model on test data

Plotting the confusion matrix

Plotting the train and validation curves

So now let's try to build another model with few more add on layers and try to check if we can try to improve the model. Therefore try to build a model by adding few layers if required and altering the activation functions.

Model 1

Trying to improve the performance of our model by adding new layers

Building the Model

Compiling the model

Using Callbacks

Fit and Train the model

Evaluating the model

Plotting the confusion matrix

Analyse misclassified Parasitized predictions

Before going furthur,I want to analyse the misclassified infections and uninfections.Let's understand fom the data to find out what we are not able to learn

Analyse misclassified Uninfected images

****Observation from the missclassifications****

It can be seen that for the missclassified parasitized images,the parasite is not quite visible/hiding on the edges.We need to capture more of such images to improve the learning process.

In the case of missclassified uninfections,the images does show the parasite infections on the edges and we need to make sure if it's really true that the cells are not parasitized.Is the data correct or were there any mistakes in the sampling process.If not, there is more information we need to clearly classify the images.

Plotting the train and the validation curves

Think about it:

Now let's build a model with LeakyRelu as the activation function

Let us try to build a model using BatchNormalization and using LeakyRelu as our activation function.

Model 2 with Batch Normalization

Building the Model

Compiling the model

Using callbacks

Fit and train the model

Plotting the train and validation accuracy

Evaluating the model

Observations and insights:

CNN Model # 3 is a complex model with 93K trainable parameters and uses dropouts and batch normalization in the convolution layers and the dense layers using a large number of neurons compared to model # 2,but there is no significant improvement in accuracy(98.50% for Model # 2 on test data and 98.35% for Model # 3 on same test data).This is expected due to data issues we saw earlier and we will analyse the misclassifications for this model.

The Model # 3 seems to be generelizing better on test data but Model # 2 has a better recall score(99%) for parasitized which means its not missing the infected cells.It misclassified only 15 out of 1300 infected cells which is very good.

Generate the classification report and confusion matrix

Lets analyse the incorrect predictions for parasitized cell images

Lets analyse the incorrect predictions for Uninfected cell images

It can be seen again that in the case of missclassified uninfections,the images does show the parasite infections similar to the actual infected image cells.This could confuse the model and impede the learning process.

Think About It :

Model 3 with Data Augmentation

Use image data generator

Think About It :

Visualizing Augmented images

Observations and insights:

Building the Model

Using Callbacks

Fit and Train the model

Evaluating the model

Plot the train and validation accuracy

Plotting the classification report and confusion matrix

Now, let us try to use a pretrained model like VGG16 and check how it performs on our data.

Pre-trained model (VGG16)

Compiling the model

using callbacks

Fit and Train the model

Plot the train and validation accuracy

Observations and insights:

What can be observed from the validation and train curves? We can see that the training accuracy is staying steady after 15 epochs but the validation accuracy shows variation of 2.5% to 5% and may generalise better if we increase the "patience" using callback,but the accuracy does not improve much.

Evaluating the model

Plotting the classification report and confusion matrix

Train our best model (CNN Model 2) using Gaussian blurred images:

Observations and insights:

It is interesting to find that Gaussian blurring has improved the recall for infections as it misclassified only 8 out of 1300 with a recall % of 99%. Let us visualise the images and see the misclassified images.

Train our best model (CNN Model 2) using HSV images:

Think about it:

What observations and insights can be drawn from the confusion matrix and classification report? Choose the model with the best accuracy scores from all the above models and save it as a final model.

Observations and Conclusions drawn from the final model:

The final model of choise is the CNN model No 2.The model gave an accuracy of 98.38% on test data and produced a recall of 99% for parasitized and 98% on uninfected image detection which indicate that the model is more accurate in predicting the infections than misclassifying the uninfections.

The CNN model No 4 that uses data augmentation is found to generalise well with training and test data but has a recall of 97% and slighly less accurate than Model No 2.This model is also a model of choice due to the fact that the variance on test performance is better than all other models but recall needs to be improved.

The Model No 5 used vgg16 for transfer model,but did not do a good job in providing the accuracy of other simpler CNN models and takes long time to train.

Improvements that can be done:

Can the model performance be improved using other pre-trained models or different CNN architecture? You can try to build a model using these HSV images and compare them with your other models.

Insights

Refined insights:

What are the most meaningful insights from the data relevant to the problem?

Comparison of various techniques and their relative performance:

How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

Proposal for the final solution design:

What model do you propose to be adopted? Why is this the best solution to adopt?

The dataset provided had enough samples for both classes eventhough the uninfected dataset is slightly less than the infected dataset. I evaluated the precision and recall for both classes to study if there is significant variation in the score due to this imbalance,but saw a difference of only 1%.This is also validated by the f1-score of 98%.

The dataset was easy to train faster using simple CNN models without much complexity and the "loss" got reduced much faster after each epoch and started to provide very good accuracy during the early stages of the training process.The parasite is mostly visible in almost all of the infected cells.

I analysed the best model's predictions that went wrong in the test dataset for both parasitized and uninfected classes and found that the test images that are truly labelled as uninfected had traces of parasites on the images and did not look very clean to be classified as uninfected.The differences could have an impact on the misclassification of infected images as well.

For this problem, we want to reduce the number of false negetives for infections as it is dangerous to misclassify an infection as uninfected and the patient could become very vulnerable.I therefore want to recommend a model that gives the best recall for infections and very good overall accuracy and also that which generalise well with test dataset. The model has an f1-score of 98% which is very good and there is little variation between precision and recall.

I trained several CNN models and found a model (CNN Model No:2) that gave the best recall for infections and also gave a good score for accuracy/f1-score.I also used gaussian blurring and converted the training images to HSV to evaluate the impact of image preprocessing on the model that gave best performance.

I see that the recall score for infections always improve when using gaussian blurring but it does not give a good accuracy as it misclassifies lots of uninfections as infected.

Converting the images to HSV, did very poor on evaluation and test datasets as it increased the "noise" on the image and the model is not able to train well.

It will be necessary to retrain the models when more data becomes available and the model could learn from new data and can furthur improve the accuracy and recall score.We looked into the test images that were misclassified to better understand the infections and uninfections and investigate if there were any errors made in the data sampling process.It was found that the images that were truly classified as uninfected did had "patches" that may indicate the presence of a parasite and vice versa.

It is important to discuss the findings with our stakeholders to get more clarity about the images that were used for training as we know if we feed "garbage" in, we get "garbage" out.If we are sure about the quality of the data,we can use better image pre-processing techniques and data augmentation to get accurate predictions with a much simpler model.